Project Summary and Aims

The two core analysis aims are:

to do: add in text from Heather’s outline

Overview of Dataset Preparation

**to finish

Initially data was matched using an extraction from the Australian Honours List, and then matched back to wikiedia and wikipedia (see section below - TO DO - ADD LINK).

Once initial analysis was done, Alex Lum assisted in providing an extraction of wikipedia pages where an order was stated on the page. This led to the Order information being merged back into the wikidata page. This allowed us to extract further wikipedia pages. No additional pages have been created in wikipedia, but we have been able to get a bettr

Honours Data Set

The Department of the Prime Minster and Cabinet publish a list of Australian Honours recipients. This list includes all recipients of the Order of Australia.

The records were extracted from this database for all of the Order of Australia Awards issued since 1975, and extracted based on the following award levels:

  1. Dame of the Order of Australia
  2. Knight of the Order of Australia
  3. Companion (AC)
  4. Officer (AO)
  5. Member (AM)
  6. Medal (OAM)

More information about the Order of Australia can be found here: https://en.wikipedia.org/wiki/Order_of_Australia.

While the majority of cases are unique, there are some individuals who have been awarded multiple Orders of Australia. In the analysis shown below, all analysis that references the Honours data set, represents the number and type of awards issued. The number of awards in our data set ar XXXXinsert here###. These awards have been given to XXX individuals. A summary of the number of awards given to indivisuals is as follows

(show summary of number of awards by number of people)

Matching the honours data set with wikimedia information

expand on pr0cess here

  1. extract data from wikidata, icluding wikidata URL/ID and wikipedia URL
  2. get wikipedia page ID for all wikipedia articles
  3. use ID to etxract page creation date

Exploratory analysis

Overview of Order of Australia honours

  1. How many have been awarded?

  2. What is the breakdown by state?

How many wikipedia pages are there for Order of Australia recipients?

  1. What are the proportion Order of Australia recipients who have a wikipedia page?

  2. What are the differences by the order level?

  3. Are there any differences by recipient state?

What can we learn about the page creation date of those who have a wikipedia page?

  1. How many had pages BEFORE or AFTER they received their Order of Australia? Is this different by order level?

  2. Does receiving an order result in a spike of wikipedia pages being created?

  3. What is the rate of creation of pages? Has there been peaks? Has it slowed at any time?

Random notes

These are just small things I find along the way that may not be that important, but are intersting or that I need to follow up on

  • Is there a big proportion of rugby and badmington players who have wikidata entries? (could be good to do some alaysis on this sort of thing / proportion of representation by description in wikiData)
  • Have gender in data set - checing with Alex on if there is way for broader gender classification
  • How to handle peopel with multiple awards - multiple honour dates to single page creation date
  • Need to find the entry of the nursing accadmic who is referenced in wikipedia page, but has no page of her own / has a wikidata entry (create list of these poeple to have a look and see who they are)
  • I think I found a few folks that had non eng wikipedia pages (think one was the Producer of shine? - need to check) Is this of interest? Possibly only a v small number. Could scrape the non english pages to see where they have page. (Also - what Australian’s in general have been given page in other languages?)
  • Can I do some “bag of word” analysis on the award description and see if there are any areas that result in more page creations than others? (scientists v politicians v sorts people etc? see point above about representation in decsription of wikidata as well)

OLD - Summary of process

  1. Extracted all Order of Australia records from The Australian Honours Search Facility
  2. Matched names of award recipients with all wikidata records using R package wikidataR. This yielded TO DO: (insert # of records)
  3. sorted all matches into three “buckets”
    1. all items that have some “Australianess” in their decsription
    2. all items that have “not-Australianess” in their description
    3. all items that have neither “Australianess” or “not-Australianess” in their description
  4. Items in the “neither” bucket where manually checked to see if they were a match to an Order of Australian recipient and moved either into the “Australianess” bucket or the “not-Australianess”
  5. Dupliacte records were then extracted from the “Australianess” bucket, were a name from the Order of Australia list was matched with a name of two or more wikidata entries. This was sorted manually by comparing the description from wikidata with the description of the merit.
  6. Part of the “Australianess” bucket was then tested for errors. Again, the description from wikidata was compared against the descipriton of the award from the Order of Australia list. From 100 enries selected, there were five incorrectly matched, and these were removed.
  7. This list was then finalised, and the wikidataID was used in a scraper to extract the linked wikipedia article and article ID from wikidata TO DO: need to check wiht Prue or Toby: is assumption correct that if a wikipedia page exists, there will be a wikidata entry, AND there will be a wikipedia page listed in the wikidata record)
  8. The edits of each matched wikipedia page were then scraped to extract the page creation date

To do: insert table showing tally of records for each stage

Of the 41303 records extracted from the honours data base, the match with a wikiData extry with a connection to Australia was 2828.

Once this was filtered for matches to a wikipedia page, there was a match of 2474 articles

Honours Data source and Extraction

The honours data set was downloaded from The Australian Honours Search Facility published by the Department of the Prime Minster and Cabinet.

The records were extracted from this database for all of the Order of Australia Awards issued since 1975, and extracted based on the following award levels:

  1. Dame of the Order of Australia
  2. Knight of the Order of Australia
  3. Companion (AC)
  4. Officer (AO)
  5. Member (AM)
  6. Medal (OAM)

More information about the Order of Australia can be found here: https://en.wikipedia.org/wiki/Order_of_Australia

The data set represents a total of 41303 hounours with each row of the data set an individual reward recipient.

The data variables in our set are:

##  [1] "AwardId"           "AwardedOn"         "AwardName"        
##  [4] "AwardAbbr"         "AwardSystem"       "ClaspLevel"       
##  [7] "ClaspText"         "GazetteName"       "GazetteGivenName" 
## [10] "GazetteSurname"    "GazetteSuburb"     "GazetteState"     
## [13] "GazettePostcode"   "AnnouncementEvent" "Division"         
## [16] "AdditionalInfo"    "Citation"

The data extracted is shown for the first few rows of the data set as below:

Merging Honours with Wikidata information

Names of Order of Australia recipients were then passed through wikidata to gather more information, such as description of the person and their aliases.

The reason for this was to ensure that names (normally awarded to the recipient with their full name), could also be matched to any other names they are known by. The query was run using an R package (wikidataR) that accesses the wikiData API and matches not only the name the award is given to, but also any alias that is listed in a wikiData entry. For example, Bob Hawke is listed as;

Bob Hawke

  • Description:
    • Australian politician, 23rd Prime Minister of Australia
  • Also known as:
    • Bob Hawke
    • The Honourable Bob Hawke
    • Robert James Lee “Bob” Hawke
    • Robertus Iacobus Lee Hawke (latin)
    • Robertus Hawke (latin)
    • 鮑勃·霍克 (chinese)

For his AC, he was named as “Mr Robert James Lee HAWKE”. The API searches against his name and aliases to give us a match.

This match of Order of Australia recipients with a wikiData match returned the following information:

The description provided in the wikidata record often incuded key words such as “Australian”. Using these as a starting point, this data set was filtered to include any mention of “Australian”, as well as other key words or phrases such as “Queensland”, “Tasmania”, “New South Wales” etc. Likewise, other words such as “United States”, “Dutch”, “Spanish” etc were excluded from the list in the absensce of an Australian related term.

Once this filtering and matching was done based on the decsription field, there was a list of “unallocated” records that I sifted through manually, and allocated to the Australian list if there was a match. This was done using other information contained in the wikiData entry or in the Honours information.

A final edit was undertaken to remove cases that referred to other “non-person” items such as parks, ovals, reserves, artciles, discographies, filmographies, foundations etc that may have included the name of the award recipient.

The information is displayed below showing the head of the data set.

Matching names to wikipedia pages

Using the list of Australian matches, the wikidata ID was used in a scraper to get the wikipedia page url and article ID from wikidata. This gave us 2474 wikipedia article links.

Extracting a page creation date

Each wikipedia page match has also been linked to a page creation date.

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=timestamp&rvdir=newer&pageids=Q2075218

The above query asks to sort all revisions from oldest to most recent, and pull top timestamp , for page ID 2352403. Using a loop function, this query was scarped using the wikipedia ID of each matched article, and the timstamp was recorded.

(This query was found via a search on stackoverflow and I built a simple scraper to store the time stamp against the page id.)

Final data merge

All data sets were then merged together, into a final data set. Cases were merged using the following method:

  1. a file was created that combined the wikidataID and the wikipedia Page ID, and inluce the scraed page creation date
  2. a second file was created merging the Order of Australia data set and the wikidata information, which passed on the wikidataID.
  3. the two files were merged together using wikidataID as the merge key

The variables included on the full data set are:

##  [1] "AwardId"           "AwardedOn"         "AwardName"        
##  [4] "AwardAbbr"         "GazetteName"       "suburb"           
##  [7] "state"             "GazettePostcode"   "AnnouncementEvent"
## [10] "Citation"          "Gender"            "personDescription"
## [13] "refurl"            "date_awarded"      "wpURL"            
## [16] "wikipediaPageID"   "name"              "pageCreation"     
## [19] "wdData"            "wpPage"            "award"

Data set and process limitations

There are pros and cons to this method. It speeds up a manual process of checking if the matched records are Order of Australia award recipients. It also means that inadvertantly a record may have been included that may not have been an Order of Australia recipient, but had a name and text identifyer (such as “Australian”, “Queensland” etc) match.

For example Bob Smith has received an AM. He has no wikiData entry. Bob Smith does not have an AM, but has a wikiData entry and a description that says “Australian medical researcher”. The second Bob Smith will be included in the list that is matched to the wikipedia query, even though he has no award. If we also has a wikipedia page he will be included in the final data set.

If an award recipient’s description included another country but did not mention “Australia” or other Australian related terms, it will not included in the list. For example Jane A Smith has a wikidata entry. She has receievd an OA. Her decription says “Italian-born artist”. She would be excluded from our list based on the presence of “Italian” without an Australian qualifier. If Jane Smith has an OA and is described as an “Italian-born Australian artist” she is included on our list.

It is hypothesised that these examples are the exception rather than the rule, and the majorty of matching cases identified in process are correct. A manual check of approx 500 cases resulted in five match errors.